Skip to main content

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: Moongleam Date: May 13, 2010 9:58am
Forum: feature_films Subject: Re: Public domain findings

| that HathiTrust is kind of a pain in the ass to search!

Yes. The text produced by g@@gle's OCR software is replete with errors. Some of the pages were pure gibberish, so I had to OCR them myself using TopOCR. Parts of pages in some volumes are missing, so my program also searches the indexes for those volumes.

My program uses a fuzzy matching algorithm so that it will find the title even if there are errors in it. Example:

** MEN ARE SUCH FOOLS ** 1938 **

-------------- 1966-A.txt --------------
[No matches.]
-------------- 1966-B.txt --------------
[No matches.]
-------------- 1965-A.txt --------------
HEN ARE SUCH FOOLS, a photoplay In
eight reels by Warner Bros. Pictures.
(C) 8:-1ar38; LP8102. United Artists
Television, Inc. (PWH); 7Apr65;

-------------- 1965-B.txt --------------
[No matches.]
-------------- 1967-A.txt --------------
[No matches.]
-------------- 1967-B.txt --------------
[No matches.]

The 0.056 is the amount of error.

Reply to this post
Reply [edit]

Poster: skybandit Date: May 13, 2010 7:40pm
Forum: feature_films Subject: Re: Public domain findings

Neat trick! My computer's so dumb I can cheat at Freecell.